Monitoring and Observability

Core concepts behind monitoring, alerting, and observability for self-hosted systems

created: Sat Mar 14 2026 00:00:00 GMT+0000 (Coordinated Universal Time) updated: Sat Mar 14 2026 00:00:00 GMT+0000 (Coordinated Universal Time) #monitoring#observability#operations

Summary

Monitoring and observability provide visibility into system health, failure modes, and operational behavior. For self-hosted systems, they turn infrastructure from a black box into an environment that can be maintained intentionally.

Why it matters

Without visibility, teams discover failures only after users notice them. Observability reduces diagnosis time, helps verify changes safely, and supports day-two operations such as capacity planning and backup validation.

Core concepts

  • Metrics: numerical measurements over time
  • Logs: event records produced by systems and applications
  • Traces: request-path visibility across components
  • Alerting: notifications triggered by actionable failure conditions
  • Service-level thinking: monitoring what users experience, not only host resource usage

Practical usage

A practical starting point often includes:

  • Host metrics from exporters
  • Availability checks for critical endpoints
  • Dashboards for infrastructure and core services
  • Alerts for outages, storage pressure, certificate expiry, and failed backups

Best practices

  • Monitor both infrastructure health and service reachability
  • Alert on conditions that require action
  • Keep dashboards focused on questions operators actually ask
  • Use monitoring data to validate upgrades and incident recovery

Pitfalls

  • Treating dashboards as a substitute for alerts
  • Collecting far more data than anyone reviews
  • Monitoring only CPU and RAM while ignoring ingress, DNS, and backups
  • Sending noisy alerts that train operators to ignore them

References